What is the Red Wine?: Red wine is a type of wine made from dark-colored (black) grape varieties. The actual color of the wine can range from intense violet, typical of young wines, through to brick red for mature wines and brown for older red wines. The juice from most purple grapes is greenish-white; the red color comes from anthocyan pigments (also called anthocyanins) present in the skin of the grape; exceptions are the relatively uncommon teinturier varieties, which produce a red colored. Much of the red-wine production process therefore involves extraction of color and flavor components from the grape skin.
What is the dataset?: This tidy data set contains 1,599 red wines with 11 variables on the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).
Data Overview: This dataset is public available for research. The details are described in [Cortez et al., 2009].
Input variables (based on physicochemical tests):
1 - fixed acidity (tartaric acid - g / dm^3)
2 - volatile acidity (acetic acid - g / dm^3)
3 - citric acid (g / dm^3)
4 - residual sugar (g / dm^3)
5 - chlorides (sodium chloride - g / dm^3
6 - free sulfur dioxide (mg / dm^3)
7 - total sulfur dioxide (mg / dm^3)
8 - density (g / cm^3)
9 - pH
10 - sulphates (potassium sulphate - g / dm3)
11 - alcohol (% by volume)
12 - quality (score between 0 and 10)
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
As listed above, the data contains 1599 obs. of 13 variables. The first variable (X) is useles in our analysis. So, we are going to delete it.
Light: Alcohol < 9
Medium: Alcohol >= 9 and Alcohol <= 12
Strong: Alcohol > 12
So, We are going to add new column to categorize the wine type if it’s Light, Medium or Strong.
##
## Light Medium Strong
## 37 1421 141
As we see, Light = 37, Medium = 1421 and Strong = 141
Poor: Quality < 9
Good: Quality >= 9 and Quality <= 12
Excellent: Quality > 12
So, We are going to add new column to rank the wine type if it’s Poor, Good or Excellent
##
## Excellent Good Poor
## 18 837 744
As we see, Excellent = 18, Good = 837 and Poor = 744
Tip: Here, we are going to perform some preliminary exploration of Red Wines dataset.
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality category ranking
## Min. :3.000 Light : 37 Excellent: 18
## 1st Qu.:5.000 Medium:1421 Good :837
## Median :6.000 Strong: 141 Poor :744
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
The result above is listing basic statistics about each variable.
The fixed.acidity is skewed to the right.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
The above table shows a summary of the fixed.acidity variable.
The citric.acid also is skewed to the right.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
The above table shows a summary of the citric.acid variable.
The pH is normally distributed.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
The above table shows a summary of the pH variable.
The chlorides is skewed to the right. We are going to apply the 10th log to it.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The distribution of chlorides is normally distributed.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
The above table shows a summary of the chlorides variable.
The residual.sugar is skewed to the right. We are going to apply the 10th log to it.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The distribution of residual.sugar is normally distributed.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
The above table shows a summary of the residual.sugar variable.
The density is normally distributed.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0040
The above table shows a summary of the density variable.
We can see that most of alcohol exists from 9 to 11.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
The above table shows a summary of the alcohol variable.
The free.sulfur.dioxide is skewed to the right. We are going to apply the 10th log to it.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The distribution of free.sulfur.dioxide isn’t normally distributed. It seems to be like a bimodal distribution. Because free.sulfur.dioxide near to 9 is very low.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
The above table shows a summary of the free.sulfur.dioxide variable.
The total.sulfur.dioxide is skewed to the right. We are going to apply the 10th log to it.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The distribution of total.sulfur.dioxide is normally distributed.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
The above table shows a summary of the total.sulfur.dioxide variable.
The volatile.acidity is normally distributed.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
The above table shows a summary of the volatile.acidity variable.
Red Wine dataset contains 1599 records. Also, it has 14 variables. And this is the structure
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "quality"
## [13] "category" "ranking"
For sure, main interest is in quality variable. We are going to see how it get affected by variables like percentage of alcohol, density and chlorides.
Category and ranking will help me in the investigation into my features of interest.
Yes, as listed above. I created the category of wine [Light, Medium, Strong] and rating of wine [Poor, Good, Excellent]
Some variabels like [residual.sugar & alcohol] are skewed to the right. So, I made a log10 transformation to them to be normally distributed. I did because they always affect the quality.
We are going to group our sample based on the quality of it. We will have the mean volatile.acidity and median volatile.acidity also the number of occurence.
## # A tibble: 6 x 4
## quality mean_volatile.acidity median_volatile.acidity n
## <int> <dbl> <dbl> <int>
## 1 3 0.8845000 0.845 10
## 2 4 0.6939623 0.670 53
## 3 5 0.5770411 0.580 681
## 4 6 0.4974843 0.490 638
## 5 7 0.4039196 0.370 199
## 6 8 0.4233333 0.370 18
And here is a summary about the new variable rw.quality
## quality mean_volatile.acidity median_volatile.acidity
## Min. :3.00 Min. :0.4039 Min. :0.3700
## 1st Qu.:4.25 1st Qu.:0.4419 1st Qu.:0.4000
## Median :5.50 Median :0.5373 Median :0.5350
## Mean :5.50 Mean :0.5800 Mean :0.5542
## 3rd Qu.:6.75 3rd Qu.:0.6647 3rd Qu.:0.6475
## Max. :8.00 Max. :0.8845 Max. :0.8450
## n
## Min. : 10.00
## 1st Qu.: 26.75
## Median :126.00
## Mean :266.50
## 3rd Qu.:528.25
## Max. :681.00
Our dataset contains some outlier, so we used median. We can see that as the quality increses, the volatile acidity decreases.
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.000 -0.256 0.672
## volatile.acidity -0.256 1.000 -0.552
## citric.acid 0.672 -0.552 1.000
## residual.sugar 0.115 0.002 0.144
## chlorides 0.094 0.061 0.204
## free.sulfur.dioxide -0.154 -0.011 -0.061
## total.sulfur.dioxide -0.113 0.076 0.036
## density 0.668 0.022 0.365
## pH -0.683 0.235 -0.542
## sulphates 0.183 -0.261 0.313
## alcohol -0.062 -0.202 0.110
## quality 0.124 -0.391 0.226
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.115 0.094 -0.154
## volatile.acidity 0.002 0.061 -0.011
## citric.acid 0.144 0.204 -0.061
## residual.sugar 1.000 0.056 0.187
## chlorides 0.056 1.000 0.006
## free.sulfur.dioxide 0.187 0.006 1.000
## total.sulfur.dioxide 0.203 0.047 0.668
## density 0.355 0.201 -0.022
## pH -0.086 -0.265 0.070
## sulphates 0.006 0.371 0.052
## alcohol 0.042 -0.221 -0.069
## quality 0.014 -0.129 -0.051
## total.sulfur.dioxide density pH sulphates alcohol
## fixed.acidity -0.113 0.668 -0.683 0.183 -0.062
## volatile.acidity 0.076 0.022 0.235 -0.261 -0.202
## citric.acid 0.036 0.365 -0.542 0.313 0.110
## residual.sugar 0.203 0.355 -0.086 0.006 0.042
## chlorides 0.047 0.201 -0.265 0.371 -0.221
## free.sulfur.dioxide 0.668 -0.022 0.070 0.052 -0.069
## total.sulfur.dioxide 1.000 0.071 -0.066 0.043 -0.206
## density 0.071 1.000 -0.342 0.149 -0.496
## pH -0.066 -0.342 1.000 -0.197 0.206
## sulphates 0.043 0.149 -0.197 1.000 0.094
## alcohol -0.206 -0.496 0.206 0.094 1.000
## quality -0.185 -0.175 -0.058 0.251 0.476
## quality
## fixed.acidity 0.124
## volatile.acidity -0.391
## citric.acid 0.226
## residual.sugar 0.014
## chlorides -0.129
## free.sulfur.dioxide -0.051
## total.sulfur.dioxide -0.185
## density -0.175
## pH -0.058
## sulphates 0.251
## alcohol 0.476
## quality 1.000
We can see that there is a highe correlation between alcohol and quality.
## Warning in warn_if_args_exist(list(...)): Extra arguments: 'wrap' are
## being ignored. If these are meant to be aesthetics, submit them using the
## 'mapping' variable within ggpairs with ggplot2::aes or ggplot2::aes_string.
The above visualization shows a strong correlation between quality & alcohol. Also, it shows negtibe correlation between volatile.acidity & quality and cetric.acid & pH.
Now, we will check relation between pH and alcohol.
## `geom_smooth()` using method = 'gam'
The above visualization shows low correlation, so we are going to check it with correlation test
##
## Pearson's product-moment correlation
##
## data: rw$pH and rw$alcohol
## t = 8.397, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1582061 0.2521123
## sample estimates:
## cor
## 0.2056325
As expected, now much correlation between them.
The above visualization shows some correlation, so we are going to check it with correlation test
##
## Pearson's product-moment correlation
##
## data: rw$density and rw$alcohol
## t = -22.838, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5322547 -0.4583061
## sample estimates:
## cor
## -0.4961798
We can see that there is a negtive correlation between [density, alcohol].
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Warning: Removed 7 rows containing non-finite values (stat_smooth).
## Warning: Removed 7 rows containing missing values (geom_point).
## Warning: Removed 12 rows containing missing values (geom_smooth).
The above visualization shows outlaier, so we are going to check it with correlation test
##
## Pearson's product-moment correlation
##
## data: rw$chlorides and rw$alcohol
## t = -9.0617, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2672644 -0.1740057
## sample estimates:
## cor
## -0.2211405
We can see that there is no much correlation between [chlorides, alcohol].
The relationship between alcohol & density, When alcohol increases, density decreases.
I was expecting many strong relationships, but the dataset shows nothing.
Tip: Now it’s time to put everything together. Based on what you found in the bivariate plots section, create a few multivariate plots to investigate more complex interactions between variables. Make sure that the plots that you create here are justified by the plots you explored in the previous section. If you plan on creating any mathematical models, this is the section where you will do that.
The above two visualizations show that the excellent wine has a higer alcohol.
I can clrearly see that alcohol and density affact the ranking of the wine.
Ranking with other variables.
We can see that most of the sample quality are between [5, 6].
We can see that as the density increses, the pH decreses. Which meaen there is a negtive correlation.
A strong correlation, as alcohol increase, density decrease.
We can see that alcohol & density are the best indicators for the wine quality. This will help all people who drinks red wine to take better decisions based on chemical details.